Accurate and lightweight MRI super-resolution via multi-scale bidirectional fusion attention network

High-resolution magnetic resonance (MR) imaging has attracted much attention due to its contribution to clinical diagnoses and treatment. However, because of the interference of noise and the limitation of imaging equipment, it is expensive to generate a satisfactory image. Super-resolution (SR) is a technique that enhances an imaging system’s resolution, which is effective and cost-efficient for MR imaging. In recent years, deep learning-based SR methods have made remarkable progress on natural images but not on medical images. Most existing medical images SR algorithms focus on the spatial information of a single image but ignore the temporal correlation between medical images sequence. We proposed two novel architectures for single medical image and sequential medical images, respectively. The multi-scale back-projection network (MSBPN) is constructed of several different scale back-projection units which consist of iterative up- and down-sampling layers. The multi-scale machine extracts different scale spatial information and strengthens the information fusion for a single image. Based on MSBPN, we proposed an accurate and lightweight Multi-Scale Bidirectional Fusion Attention Network(MSBFAN) that combines temporal information iteratively. That supplementary temporal information is extracted from the adjacent image sequence of the target image. The MSBFAN can effectively learn both the spatio-temporal dependencies and the iterative refinement process with only a lightweight number of parameters. Experimental results demonstrate that our MSBPN and MSBFAN are outperforming current SR methods in terms of reconstruction accuracy and parameter quantity of the model.


Introduction
Magnetic resonance imaging (MRI) is a non-invasive medical imaging technique that offers outstanding spatio-temporal resolution and clear soft-tissue been contrast. Since its invention in 1972, MRI has proven to be a versatile imaging technique and is widely used in hospitals and clinics. Compared with other imaging techniques such as computed tomography (CT) and positron emission tomography (PET), MRI does not involve X-rays or the use of ionizing radiation. However, clinically, to acquire high-quality MR images, patients usually are needed to remain stable in a narrow tube for a long time, which aggravates the patient's discomfort and unavoidably introduces motion artefacts that compromise image quality. Long acquisition times and the sustained increase in demand for MRI within health systems has led to concerns about cost-effectiveness.
To accelerate the acquisition speed and ensure the quality of the MR image, a large number of published studies consider adopting super-resolution (SR) algorithms without any hardware update, which have been widely studied and applied in the natural image domain. The image SR algorithm base on interpolation estimates the value of the current pixel through the adjacent pixels [1,2]. By combining prior information, the image SR algorithm based on reconstruction generates the high-resolution image [3]. The proposition of compressed sensing (CS) proved that the sparsity of a signal could be exploited to recover it from far fewer samples than required by the Nyquist-Shannon sampling theorem. Past literature based on CS has achieved preferable performance in MR imaging [4,5]. Lingala et al. [6] show that exploiting spatiotemporal redundancy from sequence MR images can immensely improve image reconstruction quality. However, one of the most significant challenges of those traditional approaches is that the reconstructed image introduces smoothness and aliasing artefacts that work so intensely to the disadvantage of image quality. Furthermore, the regularization functions and their hyper-parameters are sensitive and must be selected carefully, which brings great difficulty to practical application.
In recent years, multiple advanced SR models [7][8][9][10] have been proposed with the significant development of deep learning and attracted increasing attention due to their superior performance on natural images. Unlike traditional algorithms, deep learning methods directly learn an end-to-end mapping between the low/high-resolution image-pairs without specifying the point information and regularization in the training process. Dong et al. [11] is the pointing work that introduces convolutional neural networks (CNNs) [12] to the SR field, which confirms the advantage of CNN in image feature extraction. The following works mainly focuse on developing model depth and width to construct more complex structures that have better extract and merge feature maps. However, those methods are mainly aimed at natural image SR rather than medical image. Meanwhile, training and applying the deeper and wider models is difficult due to the great numbers of parameters and much computing resources.
Benefiting from the convenience of medical image dataset acquisition, researchers employed various kinds of neural networks to enhance the quality of MR images directly [13][14][15][16][17][18][19]. Those deep learning-based methods assimilate the characteristics of MR images and have better performance than natural image models. Although existing medical image SR methods have achieved significant improvements, they still suffer from several limitations. Firstly, almost all existing methods only extract information at a single scale and neglect the information of the other scales, which often take a lot of parameters due to the large kernel size and damage the accuracy of the model. Secondly, most existing deep learning research for MR images is based on a single image or different image sequences, which exploit inherent image redundancy to recover lost high-frequency details but ignore the temporal correlations of the medical image sequence. There is no complete exploitation of spatio-temporal dependencies. Finally, for better SR performance, the SR models are getting more and more complexity. However, non-attention methods [20][21][22] treat all image features equally, which prevents training deeper models and is detrimental to image reconstruction. Meanwhile, for sequence medical images, the abundant spatial information features of the target slice and the supplementary temporal information features from a set of adjacent slices have different effects on target slice reconstruction.
In this paper, we propose two novel networks to resolve the remaining issues mentioned above. For a single medical image, we present a multi-scale back-projection network (MSBPN) to extract the information from different scales, which is beneficial to reduce the number of parameters and further improve the SR performance. For sequence medical images, we integrate the benefits of the MSBPN and propose an accurate and lightweight multi-scale bidirectional fusion attention network (MSBFAN) to explore the spatio-temporal dependencies iteratively. Specifically, we employ MSBPN to explore the abundant spatial information of the target slice, and adopt ResNet [23]to extract the supplementary temporal information from a set of adjacent slices, then fusion attention were employed to filter and combine the spatial and temporal information to improve the quality of the target slice further. Our contributions include the following key innovations: Multi-Scale Back-Projection Network for single target MR image: We propose MSBPN for extracting details of different scales through multiple up-and down-sampling layers. We combine back-projection and multi-scale to expose residual features of multiple scales, and thus better performance and computational efficiency are achieved.
Iteratively integrating spatial and temporal information: For sequence MR images, spatial and temporal information is extracted from different sources. Spatial exploration block outputs various feature maps of the target slice and temporal exploration block extracts multiple sets of feature maps from adjacent slices. Those different sources are fused into the HR slice iteratively. To our best knowledge, and MR image SR, this is the first work to adequately investigate the temporal information supplement of the target slice from the adjacent slice.

Multi-Scale Bidirectional Fusion Attention Network for sequence MR image:
We propose MSBFAN, a bidirectional recurrent neural network based on MSBPN, which use only a modest number of parameters to achieve the state-of-the-art performance on SR task (Fig 1). Our MSBFAN effectively boost the performance via iteratively integrating spatial information and temporal information.

Related work
The key problem of image super-resolution is how to perform upsampling [24]. Based on the employed upsampling operations and their locations, the architectures of existing models can be divided into following types. Pre-upsampling models [11] also utilize traditional upsampling algorithms to obtain middle higher-resolution images and then refine them using residual learning [25,26] and recursive layers [27]. Pre-upsampling makes the model learning much easier, however, this approach often introduces extra noise and blurring, while increasing the cost of time and space. Post-upsampling [22,[28][29][30][31][32] performs the most computation in low-dimensional space to improve the computational efficiency and increase resolution automatically at the end of models. However, those models also require a large number of parameters due to failure to learn complicated mapping. Progressive upsampling [24,33] is based on the cascade of upsampling modules to decompose a complex task into several simple tasks and progressively reconstruct multiple SR images, which dramatically reduces the learning difficulty. Iterative up-and down-sampling [20,34,35] apply back-projection [36] to compute the reconstruction error then fuse it back to tune the HR image intensity. This framework can better explore the deep relationships between LR-HR image pairs. Recently, some works [37][38][39][40][41] have adopted multi-scales to fully exploit the image feature, but there is no research on the fusion of iterative projection and multi-scale.
Benefit by the development of deep learning, more and more researchers have presented 2D and 3D CNNs models for medical images. As we know, 3D CNNs also outperform 2D CNNs in spatio-temporal feature extracting. However, 3D CNNs are more difficult to train due to the small number of high-quality training samples and many parameters. Schlemper et al. [16] used a deep cascade of CNNs to reconstruct dynamic sequences of 2D cardiac MR images. Qin et al. [15] combined traditional iterative algorithms with CNN then proposed a convolutional recurrent neural network. Zhao et al. [18] proposed a deep channel splitting network (CSN) which has two branches used for different information transmissions. Qing et al. [17] found that combining multi-contrast information contributes to reconstructing the results. Zhang et al. [42] proposed a queeze and excitation reasoning attention networks for accurate 2D MR. Although those 2D CNNs models have shown excellent ability to reconstruct the 2D MR images, they still lack the ability to extract the temporal information.To extract temporal features of the 3D MR volume more fully, 3D CNNs models extract features from 3D MR volume directly. One kind of 3D CNNs model [43] converts existing state-of-the-art deep 2D super-resolution models into 3D versions and improves some structures, which is the most convenient and fastest way to apply 3D CNNs to MR images. However, those kinds of models struggle to strike a balance between the number of parameters and performance. Recently, Li et al. [44] presented a ParalleNet using parallel connections and group convolution to treat features on different channels unequally. The number of network parameters and computational complexity can be reduced significantly while maintaining accuracy.

Multi-scale back-projection
In the case of 4x, as shown in Fig 2, we construct an end-to-end trainable architecture based on four scales (1x, 2x, 3x and 4x), then a channel selection mechanism will select and output the concatenated HR features. Where k = 1, 2, . . ., n and j = 2, 3, . . ., m are the index of the k − th projection unit in each scale and the j − th layer in the projection unit. Stacked up-and down-sampling layers output the synthesized LR feature for each scale and map it to HR features, such that the detail of different scales can be fused into the HR image. In such a design, the MSBPN integrates different scale detail, further improves the performance, and simplifies our model's parameter.

Overall network architecture
The overall structure of the proposed MSBFAN model is illustrated in Fig 3. The operation of MSBFAN can be divided into three parts: Initial feature extraction, spatio-temporal attention modules and reconstruction. Firstly, the initial feature extraction module is employed to extract the shallow features of the input sequence LR images {. . ., I t−1 , I t , I t+1 , . . .} where I t is the target slice. Subsequently, these shallow features are then transmitted to the spatio-temporal attention module iteratively to generate the hierarchical features {S f t ; S b t } and output SR images {H t }. Finally, the output SR images are collected into the reconstruction module to generate the final SR image.
Initial feature extraction. The initial feature extraction module consists of a 3 � 3 convolution layer and an activation layer. Denote F E (�) as the feature extract function, then for target The overall structure consists of three parts: Initial feature extraction F E (.), spatio-temporal module F STAM (.) and reconstruction F R (.). The horizontal line is based on our MSBPN to explore the spatial information of target slice. The vertical line computes the residual features from a pair of target and neighbor slices to explore the temporal information. On each spatio-temporal attention module, the spatial information and the temporal information are connected and enhanced to recover the missing details. https://doi.org/10.1371/journal.pone.0277862.g003

PLOS ONE
slice I t , the shallow features S t extracted can be represented as: For each neighbouring slice {. . ., I t−1 , I t+1 , . . .}, we simply concatenate the I t with I t+k , the shallow features T t+k extracted can be represented as: Spatio-temporal attention module. Our proposed STAM is illustrated in Fig 4. The STAM is composed of temporal exploration block (TEB), spatial exploration block (SEB), spatio-temporal attention block (STAB) and downsampling block (DB). Here, abundant spatial information is extracted by SEB, and temporal information is extracted by TEB. We extract the missing details of the target slice by STAB, which integrates the SEB and TEB paths, then produce a refined HR feature. This part receives S f tÀ 1 , S b tþ1 and T t , and outputs S f t , S b t and H t . Temporal exploration block. Similar to ResNet, we stack several residual groups which contains two residual layers to form a very lightweight network. Denote F T (�) as the TEB function, then for shallow temporal features T t+k , the output of k-th TEB can be obtained by Spatial exploration block. We creatively propose a multi-scale back-projection network that stacks multi-scale projections which contain several up-and down-sampling layers to expose the different scales projection errors. Denote F S (�) as the SEB function, then for shallow spatial features S t+k , the output of k-th SEB can be obtained by Spatio-temporal attention block. STAB receives and concatenates F T (T t+k ) and F S (S t+k ), then produces refined periodical HR features through the spatio-temporal fusion attention. Denote F A (�) as the spatio-temporal fusion attention function, the HR features H t+k produced can be represented as: Downsampling block. DB downsamples the HR features H t+k then outputs S f tþk and S b tþk . Denote F D (�) as the downsampling function, the S f tþk and S b tþk produced can be represented as: Therefore, supposing the sequence medical images have n + 1 slices I t+k , k 2 [1, n], and I t is the target slice, then the output of the last STAM can be iteratively formulated as follow: These periodical HR features constitute the final out of our STAMs.
Reconstruction. The final SR output is generated by feeding concatenated HR features for all STMs into a reconstruction module, the In our model, F rec is a single convolution layer with the kernel size of 3 � 3.

Experimental results
In this section, we first introduce the training dataset and implementation details. Then we compare the different configurations of MSBPN and the whole network on SR performance. Finally, our MSBFAN model is compared with several state-of-the-art SR algorithms. We evaluate the quantitative SR result with PSNR and SSIM. In all our experiments, we focus on 4x SR factor.

Dataset and implementation details
Our training dataset is constructed of the IXI dataset which contains three subsets of MR images: 578 PD volumes, 581 T1 volumes and 578 T2 volumes. We divided the training set, testing set, and verification set in a ratio of approximately 100:10:1 for each subset. We select and clip these three types of 3D volumes to the size of 240 x 240 x 91 (height x width x depth) and then generate 47985, 51548, 51184 2D training examples and 6855, 7364, 7312 7-slices training examples, respectively. We also apply augmentation, such as flipping and rotation, to generate the LR image. We downscale the HR image with bicubic interpolation. For TEB, we construct nine blocks where each block consists of two 3 � 3 convolutional layers. The up-and down-layer in TEB and DB use 8 � 8 kernel with stride = 4 and pad by 2 pixels. For SEB, we construct four scales (1x, 2x, 3x, 4x) projection units where each projection unit consists of three up-sampling layers and two down-sampling layers (n = 1, m = 2). For 1x projection unit, the up-and down-sampling layers use 3 � 3 kernel with stride = 1 and pad by 1 pixel; For 2 x projection unit, the up-and down-sampling layers use 6 � 6 kernel with stride = 2 and pad by 2 pixels; For 3x projection unit, the up-and down-sampling layers use 7 � 7 kernel with stride = 3 and pad by 2 pixels; And for 4x, the configuration of the up-and down-sampling layers same with the TEB and DB. The number of feature maps is used c t = c s = 64.
We train the models with patch size 48 × 48, which is cropped randomly from 60 × 60 LR images. All models are trained end-to-end using L 1 loss, and the learning rate is initialized as 10 −4 for all layers and decrease by a factor of 10 for half of the total 100 epochs. For optimization, we used Adam by setting β 1 = 0.9, β 2 = 0.999 and � = 10 −8 . All experiments were conducted using Python 3.8.5 and PyTorch 1.6.0 on NVIDIA GeForce GTX 1080 Ti GPU.

Model analysis
Multi-scale back-projection network. The proposed MSBPN can be configured in several ways. For comparison, we have verified the structure of different MSBPN modules from the following aspects: Back-projection. To study the impact of different configurations of back-projection, we construct multiple modules to show the tradeoff between performance and the number of network parameters. Specifically, we create two kinds of modules M 1,n , and M 2,n to investigate the impact of the number of convolutional layers of the projection unit. We also created the other three kinds of modules M m,1 , M m,2 , and M m,3 to investigate the impact of the number of projection units. The training and testing results are shown in Fig 5(a) and Table 1. It can be seen that the performance of the model is improved with the deepening of the network depth which is mainly determined by the number of projection units m and the number of convolutional layers of projection unit n. We can infer that the performance improvement on our MSBPN is mainly due to the increase of model depth. However, the depth of the model does not increase indefinitely. When m = 2, n = 3, the model training began to be unstable, proving that models with complex structures and plenty of parameters are promising to improve model performance, but it is more challenging to be fully trained with MR images.
Dense connection. We can remove the dense connection of the MSDPN to show how dense connection influences the performance of the model in three cases, as shown in Fig 5(b) and   Table 2. Dense connection stabilizes the training deeper network and adaptively reuses the extraction of information from current and preceding back-projection units.
Multi-scale. To demonstrate the advantage of our multi-scale mechanism, we build two kinds of networks, SS which adopt single-scale projection units (four 4x units) and MS which adopt multi-scale projection units (1x, 2x, 3x and 4x units). Those two networks were compared in terms of performance and the number of parameters. The results on 4x enlargement are shown in Fig 5(c) and Table 3. It is observed that the multi-scale machine helpful to reduce the parameters of the model significantly, and the performance is not compromised.
Multi-scale bidirectional fusion attention network. In this part, we validate several components of the proposed MSBFAN and mainly focuses on temporal information usage.
Baselines. We consider three baselines with different spatial and temporal information fusion. First, we simplify concatenate all slices (7 slices) as the input of the SEB, which introduces temporal information but hasn't been explored enough. Second, we remove the S b t stream, only keep S f t , which turns off the backward temporal connection. Third, we remove the S f t stream, only keep S b t , which turns off the forward temporal connection. The testing results are shown in the Table 4. The results of SEB (1 slice) and SEB (7 slices) show that extracted information from neighbouring slices contributes to the image reconstruction. The combination of spatial and temporal information is also important. The full MSBFAN model can achieve 31.41 dB, which is better than 0.37db, 0.08 dB and 0.09 dB than SEB(7slices), MSBFAN (forward) and MSBFAN (backward).
Slice length. We evaluated MSBFAN with different lengths of MR image sequences. Fig 6  shows the performance improves with the more extended slices. As we can see from the figure, the model achieves the most improvement when increasing the slices from one to two and three, on account of the slices closest to the target slice have the highest correlation. The performance of MSBFAN/4 is even better than RBPN/7 which we refer to. Predictably, the performance of our MSBFAN will be further improved as the number of slices increases.
Slice order. When selecting the MR image sequence to serve as a slice for the target slice I t , we have a choice of how to choose it. We consider three case: use only past 2 slices (I t−2 , I t−1 ), named P; use only future 2 slices (I t+1 , I t+2 ), named F; use both past 1 slice (I t−1 ) and future 1 slice (I t+1 ), named PF. P represents the network is trained and tested on P, P ! F represents the network is trained on P and tested on F. The results are shown in Table 5. Our intuition suggests, and the results confirm, that PF is better than P and F by 0.19dB, since the nearest slice has more supplementary information about the target slice. P is better than P ! F by 0.25dB, and F is better than F ! P by 0.21dB indicate that the MR image sequence does not have symmetry. P and F achieve similar performance indicates that the model is robust and insensitive to the order of MR image sequence. Ablation study. To verify the superiority of our MSBFAN, we investigate the basic network modules: the SEB and STAB. To demonstrate the effect of our SEB, we use DBPN instead of our MSBPN (denote as M!D for short). To demonstrate the effect of our STAB, we use residual learning instead our spatio-temporal fusion attention (denote as FA!RL for short). We further show the effect of optical flow (OF) on the performance of our model. Table 6 shows the ablation investigation on the effects of the three described above. When we compare the results of the second line and last line, we find that model with MSBPN would perform better

Comparison with other methods
To verify the effectiveness of the proposed MSBPN and MSBFAN more scientifically, we compare them with several advanced SR algorithms: VDSR [25], LapSRN [24], EDSR [21], DBPN [20], RDN [22], CSN [18], RCAN [28], SAN [45], HAN [46], Swin [47], LBNet [48], RBPN [34] and BasicVSR [49]. All models are retrained with the same training configuration on generated three datasets. Each data has a different focus and characteristics. Table 7 shows that our MSBPN performs less than satisfactory at small scales. This is due to the small scale used less scales. For example, when the scale is 3, the scales used are x1, x2 and x3. This is done to effectively reduce the number of model parameters. Compared with RDN, our MSBPN achieves considerable performance with a small number of parameters for scale factor 4. However, all these single slice-based methods perform worse than the proposed multitude slices-based MSBFAN, indicating the proposed method's superiority. Specifically, the PSNR value on three datasets achieved by our model is higher than HAN by 0.22dB, 0.21dB, and 0.32dB for scale factor 4, respectively. That is because, different scale spatial information can be well explored by our MSBPN and more supplementary temporal information can be aggregated into the features of the target slice, which is a great help in reconstructing high quality images. Our MSBFAN achieves better accuracy than the same multitude slices-based model RBPN and BasicVSR, even though RBPN has more than twice as many parameters. Note that RBPN and BasicVSR training failure for scale factor 2. This shows that simply applying natural image algorithms to medical images is not feasible. Fig 7 displays the qualitative results on three scenarios of the PD, T1, and T2 dataset, respectively. It can be observed from the zoom-in regions that our model reconstructs plentiful and more authentic details and it has the most similar entirety to the ground truth. The first and second rows show the result of a PD image. There is a lot of texture at the position indicated by the red arrow, which not completely be reconstructed in the results of other models, but our model gives a relatively comprehensive and clear reconstruction result. The third and fourth rows show the result on a T1 image. Similar to PD, there is a black ridge at the position indicated by the red arrow, which divide the area into smaller areas. Only our MSBFAN can

Conclusion
In this work, we have proposed a novel multi-scale back-projection network (MSBPN) for a single target MR image, primarily made up of different scale back-projection units to extract abundant spatial information. Inspired by video super-resolution, we also presented a multiscale bidirectional fusion attention network (MSBFAN) to integrating the spatial information and temporal information of sequential medical images. The temporal information is explored

PLOS ONE
from the medical image sequence surrounding the target slice and iteratively integrated with the spatial information, yielding gradual refinement of the high-resolution features used, eventually, to reconstruct the high-resolution target slice. In extensive experiments, we verify the various design in the ultimate performance of our model and demonstrate that, on the IXI dataset, MSBFAN achievements significantly performance advantages over most existing SR methods.